Introduction

immunarch is an R package designed to analyse TCR and BCR (immunoglobulin) repertoires, which constitute a large amount of data. The mission of immunarch is to make immune sequencing data analysis as effortless as possible—and help you focus on research instead of coding.

Features

  1. Fast and easy manipulation of immune repertoire data:

    • The package automatically detects the format of your files—no more guessing what format is that file, just pass them to the package;

    • Supports all popular TCR and BCR analysis and post-analysis formats: ImmunoSEQ, IMGT, MiTCR, MiXCR, MiGEC, MigMap, VDJtools, tcR. More coming in the future;

    • Works on any data source you are comfortable with: R data frames, data tables from data.table, databases like MonetDB, Apache Spark data frames via sparklyr.

  2. Immune repertoire analysis made simple:

    • Most methods are incorporated in a couple of main functions with clear naming—no more remembering tens and tens of functions with obscure names;

    • Repertoire overlap analysis (common indices including overlap coefficient, Jaccard index and Morisita’s overlap index)

    • Gene usage estimation (correlation, Jensen-Shannon Divergence, clustering)

    • Diversity evaluation (ecological diversity index, Gini index, inverse Simpson index, rarefaction analysis)

    • Coming in the next releases: CDR3 amino acid physical and chemical properties assessment, Kmer distribution measures and statistics, mutation networks, tracking clonotypes across time points

  • Publication-ready plots with a built-in tool for tweaking visualisations:

    • Rich visualisation procedures with ggplot2;

    • Built-in tool fixVis makes your plots publication-ready: easily change font sizes, text angles, titles, legends and many more with clear-cut GUI;

Installation

You can find the list of releases of immunarch here: https://github.com/immunomind/immunarch/releases

In order to install immunarch, you need to download it first. If you want to download the latest version, you need to download the package file, available here https://github.com/immunomind/immunarch/releases/download/latest/immunarch.tar.gz

Note that You should not un-archive it!

After downloading the file, you need to install a number of packages with R commands listed below, and run the newly installed devtools package to install immunarch locally. Upon completion the dependencies will have been already downloaded and installed.

install.packages("devtools", dependencies = T)
devtools::install_local("path/to/your/folder/with/immunarch.tar.gz", dependencies=T)

That’s it, you can start using immunarch now!

Installation troubleshooting

If you run in any trouble, try the following steps:

  1. Check your R version. Run version command in the console to get your R versions. If the R version is below 3.4.0 (for example, R version 3.1.0), try updating your R version to the latest one.

  2. Check if your packages are outdated and update them. In RStudio you can run the “Update” button on top of the package list. In R console you can run the old.packages() command to view a list of outdated packages.

  3. If you are on Mac and have issues like old packages can’t be updated, or error messages such as ld: warning: directory not found for option or ld: library not found for -lgfortran, this link will help you to fix the issue.

  4. If you are working under Linux and have issues with library or have Fortran errors, see this link

  5. If troubles still persist, message us on support@immunomind.io or create an issue in https://github.com/immunomind/immunarch/issues with the code that represents the issue and the output you get in the console.

Quick start

Importing data into R is fairly simple. The gist of the typical TCR or BCR explorational data analysis workflow can be reduced to the next few lines of code:

# Load the data to the package
immdata = repLoad("path/to/your/folder/with/repertoires")
# If you folder contains metadata.txt file, immdata will have two elements:
# - immdata$data with a list of parsed repertoires
# - immdata$meta with the metadata file

# Compute and visualise overlap statistics
ov = repOverlap(immdata$data)
vis(ov)

# Cluster samples using K-means algorithm applied to the number of overlapped clonotypes
# and visualise the results
ov.kmeans = repOverlapAnalysis(ov, .method = "kmeans")
vis(ov.kmeans)

# Compute and visualise gene usage with samples, grouped by their disease status
gu = geneUsage(immdata$data)
vis(gu, .by="Status", .meta=immdata$meta)

# Compute Jensen-Shannon divergence among gene distributions of samples, 
# cluster samples using the hierarchical clustering and visualise the results
gu.clust = geneUsageAnalysis(gu, .method = "js+hclust")
vis(gu.clust)

# Compare diversity of repertoires and visualise samples, grouped by two parameters
div = repDiversity(immdata$data, .method = "chao1")
vis(div, .by=c("Status", "Treatment"), .meta=immdata$meta)

# Tweak and fix the visalisation of diversity estimates to make the plot publication-ready
div.plot = vis(div, .by=c("Status", "Treatment"), .meta=immdata$meta)
fixVis(div.plot)

If you want to test the package without parsing any data, you can load a small test dataset provided along with the package. Load the data with the following command:

data(immdata)

Working with data in immunarch

Immunarch data format

immunarch comes with its own data format, including tab-delimited columns that can be specified as follows:

  • “Clones” - count or number of barcodes (events, UMIs) or reads;

  • “Proportion” - proportion of barcodes (events, UMIs) or reads;

  • “CDR3.nt” - CDR3 nucleotide sequence;

  • “CDR3.aa” - CDR3 amino acid sequence;

  • “V.name” - names of aligned Variable gene segments;

  • “D.name” - names of aligned Diversity gene segments or NA;

  • “J.name” - names of aligned Joining gene segments;

  • “V.end” - last positions of aligned V gene segments (1-based);

  • “D.start” - positions of D’5 end of aligned D gene segments (1-based);

  • “D.end” - positions of D’3 end of aligned D gene segments (1-based);

  • “J.start” - first positions of aligned J gene segments (1-based);

  • “VJ.ins” - number of inserted nucleotides (N-nucleotides) at V-J junction (-1 for receptors with VDJ recombination);

  • “VD.ins” - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);

  • “DJ.ins” - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);

  • “Sequence” - full nucleotide sequence.

Input / output

The package provides several IO functions:

  • repLoad - to load the repertoires, having compatible format.

  • repSave - to save changes and to write the repertoire data to a file in a specific format (immunarch, VDJtools or MonetDB).

repLoad has the .format argument that sets the format for input repertoire files. Do not provide it if you want immunarch to detect the formats and parse files automatically! In case you want to force the package to parse the data in a specific format, you can choose one of the several options for the argument:

These parsers will be available soon.

Please contact us if there are more file formats you want to be supported.

For parsing IgBLAST results process the data with MigMap first.

You can load the data either from a single file or from a folder with repertoire files. A single file can be loaded as follows:

# To load the data from a single file without forcing the data format:
immdata <- repLoad("path/to/your/folder/immunoseq_1.txt")

# To load the data from a single ImmunoSEQ file go with:
immdata <- repLoad("path/to/your/folder/immunoseq_1.txt", .format = "immunoseq")

In the second case you may want to provide a metadata file and locate it in the folder. It is necessary to name it exactly “metadata.txt”.

# For instance you have a following structure in your folder:
# >_ ls
# immunoseq1.txt
# immunoseq2.txt
# immunoseq3.txt
# metadata.txt

With the metadata repLoad will create a list in the environment with 2 elements, namely data and meta. All the data will be accessible simply from immdata$data.

Otherwise repLoad will create a list with the number of elements matching the number of your files. They will be accessible directly from immdata.

# To load the whole folder with every file in it type:

immdata <- repLoad("path/to/your/folder/")

# In order to do that your folder must contain metadata file named
# exactly "metadata.txt".

# In R, when you load your data:
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"

# Suppose you do not have "metadata.txt":
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "immunoseq_1" "immunoseq_2" "immunoseq_3"

The metadata has to be tab delimited file with first column named “Sample” and any number of additional columns with arbitrary names. The first column should contain base names of files without extensions in your folder.

Sample Sex Age Status
immunoseq_1 M 1 C
immunoseq_2 M 2 C
immunoseq_3 F 3 A

In order to import data from the external databases you have to create a connection to this database and then load the data. Make sure that the table format in your database matches the immunarch’s format.

To illustrate the use of external database, here is an example demonstrating data loading to the local MonetDB database:

# Your list of repertoires in immunarch's format
DATA
# Metadata data frame
META

# Create a temporary directory
dbdir = tempdir()

# Create a DBI connection to MonetDB in the temporary directory.
con = DBI::dbConnect(MonetDBLite::MonetDBLite(), embedded = dbdir)

# Write each repertoire to MonetDB. Each table has corresponding name from the DATA
for (i in 1:length(DATA)) {
  DBI::dbWriteTable(con, names(DATA)[i], DATA[[i]], overwrite=TRUE)
}

# Create a source in the temporary directory with MonetDB
ms = MonetDBLite::src_monetdblite(dbdir = dbdir)
res_db = list()

# Load the data from MonetDB to dplyr tables
for (i in 1:length(DATA)) {
  res_db[[names(DATA)[i]]] = dplyr::tbl(ms, names(DATA)[i])
}

# Your data is ready to use
list(data = res_db, meta = META)

You might want to make a list with data, which is a list of tables of repertoires, and the meta, containing metadata:

# Load the data to the immdata variables. Metadata file "metadata.txt" will be found automatically.
immdata = repLoad("your_folder", "optionally_your_format")
# Repertoires
immdata$data
# Metadata
immdata$meta

immunarch is compatible with following sources:

  • R data frames (for common applications)

  • R data tables (for speed, although requires lots of RAM)

  • MonetDB-like databases that support both DBI and dplyr (best choice, although you have to be a bit familiar with dplyr)

  • Apache Spark (if you are expierienced and comfortable with it)

Basic data manipulations with dplyr and immunarch

You can find the introduction to dplyr here: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Get the most abundant clonotypes

The function returns the most abundant clonotypes for the given repertoire:

top(immdata$data[[1]])
## # A tibble: 10 x 15
##    Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start
##     <dbl>      <dbl> <chr>   <chr>   <chr>  <chr>  <chr>  <int>   <int>
##  1   1647    0.0206  TGCGCC… CASSQE… TRBV4… TRBD1  TRBJ2…    16      18
##  2   1409    0.0176  TGCGCC… CASSYR… TRBV4… TRBD1  TRBJ2…    11      13
##  3    747    0.00934 TGTGCC… CATSTN… TRBV15 TRBD1  TRBJ2…    11      16
##  4    511    0.00639 TGCGCC… CASQGD… TRBV4… TRBD1  TRBJ1…     8      13
##  5    463    0.00579 TGTGCC… CATSIG… TRBV15 TRBD2  TRBJ2…    11      19
##  6    450    0.00562 TGTGCC… CASSPW… TRBV27 TRBD1  TRBJ1…    11      16
##  7    336    0.0042  TGTGCC… CASSEE… TRBV2  TRBD1  TRBJ1…    15      17
##  8    335    0.00419 TGCGCC… CASSQD… TRBV4… TRBD1  TRBJ2…    16      21
##  9    287    0.00359 TGTGCC… CASSWV… TRBV6… TRBD1  TRBJ2…    12      20
## 10    268    0.00335 TGCGCC… CASSQD… TRBV4… TRBD1  TRBJ2…    16      25
## # … with 6 more variables: D.end <int>, J.start <int>, VJ.ins <dbl>,
## #   VD.ins <dbl>, DJ.ins <dbl>, Sequence <lgl>

Filter functional / non-functional / in-frame / out-of-frame clonotypes

Conveniently, functions are vectorised over the list of data frames; and coding(immdata$data) in the example below returns a list of data frames with coding sequences:

coding(immdata$data[[1]])

The next one operates in a similar fashion:

noncoding(immdata$data[[1]])

Now, the computation of the number of filtered sequences is straightforward:

nrow(inframes(immdata$data[[1]]))

And for the out-of-frame clonotypes:

nrow(outofframes(immdata$data[[1]]))

Get subset of clonotypes with a specific V gene

It is simple to subset data frame according to labels in the specified index. In the example the resulting data frame contains only records with ‘TRBV10-1’ V gene:

filter(immdata$data[[1]], V.name == 'TRBV10-1')
## # Source:   lazy query [?? x 15]
## # Database: MonetDBEmbeddedConnection
##    Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start
##     <dbl>      <dbl> <chr>   <chr>   <chr>  <chr>  <chr>  <int>   <int>
##  1      7  0.0000875 TGCGCC… CASSES… TRBV1… TRBD2  TRBJ2…    16      20
##  2      5  0.0000625 TGCGCC… CASSDG… TRBV1… TRBD1  TRBJ2…    13      15
##  3      4  0.00005   TGCGCC… CASSEA… TRBV1… TRBD2  TRBJ2…    14      21
##  4      4  0.00005   TGCGCC… CASSEG… TRBV1… TRBD2  TRBJ2…    14      15
##  5      4  0.00005   TGCGCC… CASSAP… TRBV1… TRBD2  TRBJ2…    12      13
##  6      4  0.00005   TGCGCC… CATLRS… TRBV1… TRBD1  TRBJ2…     6       7
##  7      3  0.0000375 TGCGCC… CASSES… TRBV1… TRBD2  TRBJ2…    16      20
##  8      3  0.0000375 TGCGCC… CASSES… TRBV1… TRBD2  TRBJ2…    16      17
##  9      2  0.000025  TGCGCC… CASSGE… TRBV1… TRBD2  TRBJ2…    12      13
## 10      2  0.000025  TGCGCC… CASSES… TRBV1… TRBD2  TRBJ2…    13      15
## # … with more rows, and 6 more variables: D.end <int>, J.start <int>,
## #   VJ.ins <dbl>, VD.ins <dbl>, DJ.ins <dbl>, Sequence <lgl>

Analysis and visualisation

Basic analysis

For each task in this section immunarch includes separate functions that are generally self-explanatory and are written in camel-case. Shorter names for the functions are also availbalie, although the authors do not recommend their usage for the sake of clarity and readability of the code. The latter are mentionned in parantheses. Basic analysis functions are:

  • repExplore (rep.ex) - to compute basic statistics, such as number of clones or distributions of lengths and counts. To explore them you need to pass the statistics, e.g. count, to the .method.

  • repClonality (rep.clo) - to compute the clonality of repertoires.

  • repOverlap (rep.ov) - to compute the repertoire overlap.

  • repOverlapAnalysis (rep.ova) - to analyse the repertoire overlap, including different clustering procedures and PCA.

  • geneUsage (gen.us) - to compute the distributions of V or J genes.

  • geneUsageAnalysis (gen.usa) - to analyse the distributions of V or J genes, including clustering and PCA.

  • repDiversity (rep.div) - to estimate the diversity of repertoires.

  • trackClonotypes (tr.clo) - to analyse the dynamics of repertoires across time points.

  • spectratype - to compute spectratype of clonotypes.

Output of each function could be passed directly to the vis function - the general function for visualisation. Examples of usage are written below.

Plots generated by the vis function can be passed to fixVis - built-in software tool for making publication-ready plots. See the “Quick start” section for usage.

Exploratory analysis

For the basic exploratory analysis such as comparing of number of reads / UMIs per repertoire or distribution use the function repExplore.

exp_len = repExplore(immdata$data, .method = "len", .col = "aa")
exp_cnt = repExplore(immdata$data, .method = "count")
exp_vol = repExplore(immdata$data, .method = "volume")

p1 = vis(exp_len)
p2 = vis(exp_cnt)
p3 = vis(exp_vol)

p1

grid.arrange(p2, p3, ncol = 2)

# You can group samples by their metainformation
p4 = vis(exp_len, .by="Status", .meta=immdata$meta)
p5 = vis(exp_cnt, .by="Sex", .meta=immdata$meta)
p6 = vis(exp_vol, .by=c("Status", "Sex"), .meta=immdata$meta)

p4

grid.arrange(p5, p6, ncol = 2)

Clonality

One of the ways to estimate the diversity of samples is to evaluate clonality. repClonality measures the amount of the most or the least frequent clonotypes. There are several methods to assess clonality, let us take a view of them. The clonal.prop method computes the proportion of repertoire occupied by the pools of cell clones:

imm_pr = repClonality(immdata$data, .method = "clonal.prop")
imm_pr
##         Clones Percentage Clonal.count.prop
## A2-i129     18       10.1      4.195413e-04
## A2-i131     31       10.1      6.902387e-04
## A2-i133      8       10.3      1.947799e-04
## A2-i132    140       10.0      3.323284e-03
## A4-i191      4       11.5      1.209995e-04
## A4-i192      8       10.2      2.274472e-04
## MS1          3       12.6      8.001067e-05
## MS2         74       10.0      1.580656e-03
## MS3          2       10.8      3.762015e-05
## MS4        279       10.0      5.344828e-03
## MS5          2       10.3      5.141917e-05
## MS6        205       10.0      3.750114e-03
## attr(,"class")
## [1] "matrix"             "immunr_clonal_prop"

The top method considers the most abundant cell clonotypes:

imm_top = repClonality(immdata$data, .method = "top")
imm_top
##                10       100      1000      3000     10000     30000 1e+05
## A2-i129 0.0806625 0.1661375 0.2869000 0.3840000 0.5676000 0.8387000     1
## A2-i131 0.0666625 0.1506875 0.2912000 0.3928875 0.5636000 0.8136000     1
## A2-i133 0.1101500 0.1820250 0.3085375 0.4093875 0.5957375 0.8616000     1
## A2-i132 0.0238500 0.0854250 0.2423750 0.3728625 0.5723125 0.8484125     1
## A4-i191 0.1729375 0.3158875 0.4576250 0.5482750 0.7117750 0.9617750     1
## A4-i192 0.1142750 0.2141250 0.3629250 0.4803500 0.6744500 0.9353375     1
## MS1     0.1957000 0.3068250 0.4328750 0.5132625 0.6563125 0.9063125     1
## MS2     0.0455875 0.1091125 0.2059000 0.2919125 0.4797750 0.7898000     1
## MS3     0.1673750 0.2293625 0.2969875 0.3522750 0.4604625 0.7104625     1
## MS4     0.0191000 0.0626000 0.1737000 0.2684125 0.4537375 0.7225000     1
## MS5     0.2093375 0.2884500 0.3949000 0.4813500 0.6388000 0.8888000     1
## MS6     0.0296375 0.0750125 0.1861625 0.2783250 0.4416875 0.6916875     1
## attr(,"class")
## [1] "matrix"          "immunr_top_prop"

While the tail method deals with the least prolific clonotypes:

imm_tail = repClonality(immdata$data, .method = "tail")
imm_tail
##                 1         3        10        30       100
## A2-i129 0.3902000 0.6478375 0.7675125 0.8353500 0.8743125
## A2-i131 0.4488875 0.6335125 0.7579000 0.8353625 0.9059250
## A2-i133 0.3725375 0.6166750 0.7446000 0.8172000 0.8704750
## A2-i132 0.3754875 0.6225375 0.7950750 0.9094500 0.9705000
## A4-i191 0.3015000 0.4948500 0.5897750 0.6648000 0.7293250
## A4-i192 0.3037750 0.5276000 0.6829875 0.7731625 0.8326125
## MS1     0.3756375 0.5337000 0.6197000 0.6793500 0.7409375
## MS2     0.4001750 0.7526375 0.8507500 0.9006875 0.9349875
## MS3     0.6063500 0.7091250 0.7536625 0.7829500 0.8027875
## MS4     0.5087375 0.7636875 0.8809875 0.9487375 0.9844625
## MS5     0.3787500 0.5660625 0.6630625 0.7151625 0.7536000
## MS6     0.5722375 0.7613500 0.8715875 0.9367875 0.9690500
## attr(,"class")
## [1] "matrix"           "immunr_tail_prop"

Finally, the homeo method assesses the clonal space homeostasis, i.e., the proportion of the repertoire occupied by the clones of a given size:

imm_hom = repClonality(immdata$data, .method = "homeo")
imm_hom
##         Rare (0 < X <= 1e-05) Small (1e-05 < X <= 1e-04)
## A2-i129                     0                  0.7479750
## A2-i131                     0                  0.7390625
## A2-i133                     0                  0.7246750
## A2-i132                     0                  0.7644250
## A4-i191                     0                  0.5739125
## A4-i192                     0                  0.6598500
## MS1                         0                  0.6038500
## MS2                         0                  0.8407250
## MS3                         0                  0.7473000
## MS4                         0                  0.8653125
## MS5                         0                  0.6467750
## MS6                         0                  0.8531875
##         Medium (1e-04 < X <= 0.001) Large (0.001 < X <= 0.01)
## A2-i129                   0.1217875                 0.0920375
## A2-i131                   0.1504750                 0.0982750
## A2-i133                   0.1334375                 0.0651750
## A2-i132                   0.1959250                 0.0396500
## A4-i191                   0.1396000                 0.1568250
## A4-i192                   0.1624375                 0.0844625
## MS1                       0.1223875                 0.0972250
## MS2                       0.0863375                 0.0601875
## MS3                       0.0519500                 0.0630625
## MS4                       0.1155875                 0.0191000
## MS5                       0.1024625                 0.0662000
## MS6                       0.1102750                 0.0365375
##         Hyperexpanded (0.01 < X <= 1)
## A2-i129                     0.0382000
## A2-i131                     0.0121875
## A2-i133                     0.0767125
## A2-i132                     0.0000000
## A4-i191                     0.1296625
## A4-i192                     0.0932500
## MS1                         0.1765375
## MS2                         0.0127500
## MS3                         0.1376875
## MS4                         0.0000000
## MS5                         0.1845625
## MS6                         0.0000000
## attr(,"class")
## [1] "matrix"       "immunr_homeo"
grid.arrange(vis(imm_top), vis(imm_top, .by="Status", .meta=immdata$meta), ncol = 2)

grid.arrange(vis(imm_tail), vis(imm_tail, .by="Status", .meta=immdata$meta), ncol = 2)

grid.arrange(vis(imm_hom), vis(imm_hom, .by="Status", .meta=immdata$meta), ncol = 2)

Overlap and public repertoire

Repertoire overlap is the most common approach to measure repertoire similarity. It is achieved by computation of specific statistics on clonotypes shared between given repertoires, also called “public” clonotypes. immunarch provides several indices: - number of public clonotypes (.method = "public") - a classic measure of overlap similarity.

  • overlap coefficient (.method = "overlap") - a normalised measure of overlap similarity. It is defined as the size of the intersection divided by the smaller of the size of the two sets.

  • Jaccard index (.method = "jaccard") - it measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

  • Tversky index (.method = "tversky") - an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it’s similar to Dice’s coefficient.

  • cosine similarity (.method = "cosine") - a measure of similarity between two non-zero vectors

  • Morisita’s overlap index (.method = "morisita") - a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples.

  • top-overlap - overlaps of the N most abundant clonotypes with sequentially growing N (.method = "top+METHOD")

The function that includes described methods is repOverlap. Again the output is easily visualised when passed to vis() function that does all the work:

imm_ov1 = repOverlap(immdata$data, .method = "public", .verbose = F)
imm_ov2 = repOverlap(immdata$data, .method = "morisita", .verbose = F)

grid.arrange(vis(imm_ov1), vis(imm_ov2, .text.size=1.5), ncol = 2)

vis(imm_ov1, "heatmap2")

To analyse the computed overlap measures function apply repOverlapAnalysis.

# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
## Standard deviations:
## [1] 0 0 0 0
## 
## Rotation:
##                [,1]         [,2]
## A2-i129 -110.709050   64.5877923
## A2-i131  -20.688100   29.5930468
## A2-i133 -148.856927 -135.3873972
## A2-i132  271.865359  -12.1997230
## A4-i191   -4.646556  -46.2700405
## A4-i192  -57.459268  156.0677169
## MS1       14.902501   -8.1401151
## MS2       17.526821    0.4664452
## MS3        3.978868   -9.9128964
## MS4       15.476432   -4.0909449
## MS5       17.478676   -4.3784903
## MS6        1.131243  -30.3353937
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
##                DimI       DimII
## A2-i129 -23.3467632  16.7965447
## A2-i131  -4.4367450 -64.5645986
## A2-i133  65.9439889 -11.9350892
## A2-i132  57.1586279  41.8758166
## A4-i191  20.7809897   4.4325925
## A4-i192  26.2717380 -35.1814953
## MS1     -68.9709820   3.4369293
## MS2     -10.6871143 -12.7217957
## MS3     -38.3194937 -33.1172355
## MS4      -0.4540292  -0.7545221
## MS5     -32.8084991  56.8425185
## MS6       8.8682819  34.8903347
## attr(,"class")
## [1] "matrix"      "immunr_tsne"
# Visualise the results
vis(repOverlapAnalysis(imm_ov1, "mds"))

# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
## Standard deviations:
## [1] 0 0 0 0
## 
## Rotation:
##                [,1]         [,2]
## A2-i129 -110.709050   64.5877923
## A2-i131  -20.688100   29.5930468
## A2-i133 -148.856927 -135.3873972
## A2-i132  271.865359  -12.1997230
## A4-i191   -4.646556  -46.2700405
## A4-i192  -57.459268  156.0677169
## MS1       14.902501   -8.1401151
## MS2       17.526821    0.4664452
## MS3        3.978868   -9.9128964
## MS4       15.476432   -4.0909449
## MS5       17.478676   -4.3784903
## MS6        1.131243  -30.3353937
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
##               DimI       DimII
## A2-i129 -48.081766  24.0410239
## A2-i131 -42.157527 -16.2568961
## A2-i133 -10.058294  45.8984545
## A2-i132  57.148657  -2.3216393
## A4-i191  11.568840  14.6114506
## A4-i192  -9.232778 -19.8147357
## MS1      29.313483  51.6080489
## MS2     -12.395757   9.2580890
## MS3      23.670052 -44.0998498
## MS4      -3.126273  -0.7350184
## MS5      20.759052 -10.7725084
## MS6     -17.407690 -51.4164191
## attr(,"class")
## [1] "matrix"      "immunr_tsne"
# Visualise the results
vis(repOverlapAnalysis(imm_ov1, "mds"))

# Clusterise the MDS resulting components using K-means
vis(repOverlapAnalysis(imm_ov1, "mds+kmeans"))

In order to build a massive table with all clonotypes from the list of repertoires use the pubRep function.

# Pass "nt" as the second parameter to build the public repertoire table using CDR3 nucleotide sequences
pr.nt = pubRep(immdata$data, "nt", .verbose = F)
pr.nt
## Source: local data table [516,021 x 14]
## 
## # A tibble: 516,021 x 14
##    CDR3.nt Samples `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191`
##    <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 TGCGCC…      12        15         6         6         7        10
##  2 TGCGCC…      12        12         4        11         8         3
##  3 TGTGCC…      12         3         4         3        14         5
##  4 TGCGCC…      11         7         1         5         3         2
##  5 TGTGCC…      11        13         9        14        22        12
##  6 TGTGCC…      11         1         2        NA         1         4
##  7 TGTGCC…      11         4         7         1         5         3
##  8 TGTGCC…      11         5         2         8         8         2
##  9 TGTGCC…      10         3         3         2         7         3
## 10 TGTGCC…      10         5         1         5         5         3
## # … with 516,011 more rows, and 7 more variables: `A4-i192` <dbl>,
## #   MS1 <dbl>, MS2 <dbl>, MS3 <dbl>, MS4 <dbl>, MS5 <dbl>, MS6 <dbl>
# Pass "aa+v" as the second parameter to build the public repertoire table using CDR3 aminoacid sequences and V alleles
# In order to use only CDR3 aminoacid sequences, just pass "aa"
pr.aav = pubRep(immdata$data, "aa+v", .verbose = F)
pr.aav
## Source: local data table [505,701 x 15]
## 
## # A tibble: 505,701 x 15
##    CDR3.aa V.name Samples `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191`
##    <chr>   <chr>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 CASSFQ… TRBV5…      12        21         6         7         7        11
##  2 CASSLE… TRBV5…      12        19         5        14        10         3
##  3 CASSLE… TRBV5…      11         1         2         6         1         1
##  4 CASSLG… TRBV1…      11        24         7         2        29        15
##  5 CASSLG… TRBV5…      11        14         2         6         9         3
##  6 CASSLG… TRBV5…      11         1         1         1         3         2
##  7 CASSLQ… TRBV5…      11        16         3         2         6         1
##  8 CASSSQ… TRBV5…      11        11         1         6         3         5
##  9 CAS~FF  TRBV1…      11        14        19        10        29         2
## 10 CASSFG… TRBV1…      10         4         2         7        22         2
## # … with 505,691 more rows, and 7 more variables: `A4-i192` <dbl>,
## #   MS1 <dbl>, MS2 <dbl>, MS3 <dbl>, MS4 <dbl>, MS5 <dbl>, MS6 <dbl>
# You can also pass the ".coding" parameter to filter out all noncoding sequences first:
pr.aav.cod = pubRep(immdata$data, "aa+v", .coding=T)
# Create a public repertoire with coding-only sequences using both CDR3 amino acid sequences and V genes
pr = pubRep(immdata$data, "aa+v", .coding = T, .verbose = F)

# Apply the filter subroutine to leave clonotypes presented only in healthy individuals
pr1 = pubRepFilter(pr, immdata$meta, c(Status = "C"))

# Apply the filter subroutine to leave clonotypes presented only in diseased individuals
pr2 = pubRepFilter(pr, immdata$meta, c(Status = "MS"))

# Divide one by another
pr3 = pubRepApply(pr1, pr2)

# Plot it
p = ggplot() + geom_jitter(aes(x = "Treatment", y = Result), data=pr3)
p

Gene usage

immunarch comes with a gene segments data table containing known gene segments for several species. In order to get the current statistics of genes, call the gene_stats() function:

gene_stats()
##     alias                 species ighd ighj ighv igij igkj igkv iglj iglv
## 1      bt               BosTaurus    0    0    0    0    0    0    0    0
## 2      cd      CamelusDromedarius    0    0    0    0    0    0    0    0
## 3     clf    CanisLupusFamiliaris    0    0    0    0    0    0    0    0
## 4      dr              DanioRerio    7    7    0    3    0    0    0    0
## 5      hs             HomoSapiens   30   13  224    0    5   64    7   69
## 6  macmul           MacacaMulatta   24    7   19    0    4   83    5    0
## 7     mmc    MusMusculusCastaneus    0    0    0    0    0    4    0    0
## 8     mmd   MusMusculusDomesticus    0    0    0    0    0    2    0    0
## 9  musmus             MusMusculus   27    8  234    0    8  109    3    5
## 10     oa OrnithorhynchusAnatinus    3   10    0    0    0    0    0    0
## 11     oc    OryctolagusCuniculus   10   11   39    0    8   26    2   20
## 12     om      OncorhynchusMykiss    9    7    6    0    0    0    0    0
## 13     rn        RattusNorvegicus   30    4  113    0    6  132    2    8
## 14   smth   MusMusculusMolossinus    0    0    0    0    0    1    0    0
## 15   smth     MusMusculusMusculus    0    0    0    0    0    1    0    0
## 16   smth              MusSpretus    0    0    0    0    0    2    0    2
## 17     ss               SusScrofa    5    5   15    0    8   19    4   14
##    traj trav trbd trbj trbv trdd trdj trdv trgj trgv vprv
## 1    46    0    0    0    0    5    3    2    6   14    0
## 2     0    0    0    0    0    0    0    7    2    2    0
## 3     0    0    2    8   19    0    0    0    7    8    0
## 4     0    0    0    0    0    0    0    0    0    0    0
## 5    52   47    3   14   60    3    4    6    3    9    1
## 6     0    0    2   15   58    0    0    0    0    0    0
## 7     0    0    0    0    0    0    0    0    0    0    0
## 8     0    0    0    0    0    0    0    0    0    0    0
## 9    41  145    2   14   23    2    3    7    0   11    0
## 10    0    0    0    0    0    0    0    0    0    0    0
## 11    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    1    9    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    0    0    0    0    0
## 14    0    0    0    0    0    0    0    0    0    0    0
## 15    0    0    0    0    0    0    0    0    0    0    0
## 16    0    0    0    0    0    0    0    0    0    0    0
## 17    0    0    0    0    0    0    0    0    0    0    0

To compute the distributions of genes, immunarch includes the geneUsage function. It receives a repertoire or a list of repertoires as input and genes and species for which you want to get the statistics. E.g., if you plan to use TRBV genes of Homo Sapiens, you need to use the hs.trbv string in the function, where hs comes from the alias column and trbv is the gene name. Of if you plan to use IGHV genes of Mus Musculus, you need to use musmus.ighv:

# Next four function calls are equal. "hs" is from the "alias" column.
imm_gu = geneUsage(immdata$data, "hs.trbv")
# imm_gu = geneUsage(immdata$data, "HomoSapiens.trbv")
# imm_gu = geneUsage(immdata$data, "hs.TRBV")
# imm_gu = geneUsage(immdata$data, "HomoSapiens.TRBV")

imm_gu
## # A tibble: 48 x 13
##    Names `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191` `A4-i192`   MS1
##    <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl> <dbl>
##  1 TRBV…       296       357        69       137       163        90   171
##  2 TRBV…      1491      2032       864       783      2401      1353  1329
##  3 TRBV…       134        91       144       142       140        95   113
##  4 TRBV…      1078      1210       874      1077       674       660   542
##  5 TRBV…        64       104        56       145        31        28    54
##  6 TRBV…      3980      3190      2222      2749      2228      3513  2772
##  7 TRBV…       285       404        76       314        99       109    84
##  8 TRBV…       293       334       238       391       187       200   193
##  9 TRBV…       444       626       413       384       218       210   160
## 10 TRBV…       699      1083       746       541       468       378   790
## # … with 38 more rows, and 5 more variables: MS2 <dbl>, MS3 <dbl>,
## #   MS4 <dbl>, MS5 <dbl>, MS6 <dbl>

Gene distributions could be computed either using counts of individual clonotypes (.quant = "count") or not using them (.quant = NA).

In order to compute allele-level or family-level distributions, change the .type parameter.

Due to the ambiguity of gene alignments for some clonotypes, geneUsage has the following options to deal with ambiguous data:

  • .ambig = "exc" - filters out all clonotypes with ambiguous gene alignments.

  • .ambig = "inc" - includes all possible combinations of ambiguous gene alignments from the data. NOTE: ImmunoSEQ formats use non-standart gene segment names, so it is preferable to use this argument value with ImmunoSEQ formats. This argument is ON by default to ease the gene manipulation. Feel free to change it to "exc" in case of other data formats.

  • .ambig = "wei" - introduces weighted approach (divides by n (1/n) the frequency for each entry of the corresponding gene if there are n genes for a clonotype).

  • .ambig = "maj" - chooses only the first gene segment.

Parameter .norm controls whether immunarch will normalise the data to ensure the sum of all frequencies to be equal 1 or not.

You can visualise the histogram of gene usage in two different ways:

imm_gu = geneUsage(immdata$data, "hs.trbv", .norm = T, .ambig = "exc")

vis(imm_gu, .plot = "hist", .grid = T)

vis(imm_gu, .plot = "hist", .grid = F, .by = "Status", .meta = immdata$meta)

Another practical approach to the visualisation of group distributions are box plots:

vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")

Sometimes tree maps could be used to reveal the differences in repertoires. They display the overall picture and the comparisons of related items, both at the same time, enabling intelligible exploration of the details:

imm_gu = geneUsage(immdata$data, "hs.trbv", .norm = T, .ambig = "exc")

vis(imm_gu, .plot = "tree")

To analyse the gene usage immunarch introduces the geneUsageAnalysis function. The .method parameter controls how the data is going to be preprocessed and analysed. geneUsageAnalysis includes following methods for preprocessing:

  • “js” - Jensen-Shannon Divergence.

  • “cor” - correlation.

  • “cosine” - cosine similarity.

  • “pca” - principal component analysis.

  • “mds” - multi-dimensional scaling.

  • “tsne” - t-Distributed Stochastic Neighbor Embedding.

And a few methods for the actual analysis:

  • “hclust” - clusters the data using hierarchical clustering.

  • “kmeans” - clusters the data using K-means.

  • “dbscan” - clusters the data using DBSCAN.

  • “permut” - permutation testing of differences between groups.

  • “anova” - computes ANOVA for each gene separately on data splitted to groups (without preprocessing). Results could be used with Tukey post-hoc test in order to detect significant differences among groups per gene.

  • “kruskall” - compute Kruskall for each gene separately on data splitted to groups (without preprocessing). Results could be used with Dunn test in order to detect significant differences between groups.

You can call several methods in a single line of code, which is probably the most powerful feature of the package. For instance, "js+hclust" first computes Jensen-Shannon divergence and then applies hierarchical clustering on the resulting distance matrix, whereas "anova" computes ANOVA on each gene separately after repertoires have been grouped:

imm_gu = geneUsage(immdata$data, "hs.trbv", .norm = T, .ambig = "exc")

imm_gu_js = geneUsageAnalysis(imm_gu, .method = "js", .verbose = F)
imm_gu_cor = geneUsageAnalysis(imm_gu, .method = "cor", .verbose = F)

gridExtra::grid.arrange(vis(imm_gu_js, .title = "Gene usage JS-divergence", .leg.title = "JS", .text.size=1.5), vis(imm_gu_cor, .title = "Gene usage correlation", .leg.title = "Cor", .text.size=1.5), ncol = 2)

Now let us visualise the output after both preprocessing and analysis:

imm_gu_js[is.na(imm_gu_js)] = 0

vis(geneUsageAnalysis(imm_gu, "cosine+hclust", .verbose = F))

#vis(geneUsageAnalysis(imm_gu, "js+dbscan", .verbose = F))

On top of that you can add clustering:

imm_cl_pca = geneUsageAnalysis(imm_gu, "js+pca+kmeans", .verbose = F)
imm_cl_mds = geneUsageAnalysis(imm_gu, "js+mds+kmeans", .verbose = F)
imm_cl_tsne = geneUsageAnalysis(imm_gu, "js+tsne+kmeans", .perp = .01, .verbose = F)
## Perplexity should be lower than K!
grid.arrange(vis(imm_cl_pca, .plot = "clust"), vis(imm_cl_mds, .plot = "clust"), vis(imm_cl_tsne, .plot = "clust"), ncol = 3)

Diversity

Sevral approaches to the estimation of repertoire diversity are implemented in the repDiversity function. The .method parameter similarly to abovementionned functions sets the means for diversity estimation. You can choose one of the following methods:

  • chao1 - Chao1 estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population).

  • hill - Hill numbers are a mathematically unified family of diversity indices (differing only by an exponent q).

  • div - True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant.

  • gini.simp - The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types.

  • inv.simp - Inverse Simpson index is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.

  • gini - The Gini coefficient measures the inequality among values of a frequency distribution (for example levels of income). A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of one (or 100 percents ) expresses maximal inequality among values (for example where only one person has all the income).

  • raref - Rarefaction is a technique to assess species richness from the results of sampling through extrapolation.

# Compute statistics and visualise them
# Chao1 diversity measure
div_chao = repDiversity(immdata$data, "chao1")

# Hill numbers
div_hill = repDiversity(immdata$data, "hill")

# D50
div_d50 = repDiversity(immdata$data, "d50")

# Ecological diversity measure
div_div = repDiversity(immdata$data, "div")

p1 = vis(div_chao)
p2 = vis(div_chao, .by=c("Status", "Sex"), .meta=immdata$meta)
p3 = vis(div_hill, .by=c("Status", "Sex"), .meta=immdata$meta)

p4 = vis(div_d50)
p5 = vis(div_d50, .by="Status", .meta=immdata$meta)
p6 = vis(div_div)

gridExtra::grid.arrange(p1, p2, ncol = 2)

gridExtra::grid.arrange(p3, p6, ncol = 2)

gridExtra::grid.arrange(p4, p4, ncol = 2)

imm_raref = repDiversity(immdata$data, "raref", .verbose = F)

grid.arrange(vis(imm_raref), vis(imm_raref, .by="Status", .meta=immdata$meta), ncol=2)

Dynamics

Tracking the dynamics of clonotypes across time points in different repertoires is a very useful tool in medicine. It is possible to track clonotypes based on their cdr3 sequences with trackClonotypes function, which will be applied to those defined by .which parameter. The .which parameter takes one of the three kinds of input:

  • .which input can be a vector of length two c(X, Y), where Y is the number of the most abundant clonotypes to take from X’th data frame in the input list.

  • .which input can be a character vector of sequences to take from all data frames.

  • .which input can be a data frame with two or three columns - first for the sequences, and second/third for the gene segments.

In order to estimate possible noise in the data immunarch uses two simple model based on either poisson distribution (.model = "poisson") or on normal distribution (.method = "norm"). Pass the none string to use no model.

# imm_tr = 1
  
# grid.arrange(vis(imm_tr, .plot = "line"), vis(imm_tr, .plot = "area"), ncol = 2)

Spectratyping

Spectratype is a useful way to represent distributions of genes per length. Parameter .quant controls the quantity that used to compute proportions of genes - either by clonotype (id) or by number of clones per clonotype (count)

p1 = vis(spectratype(immdata$data[[1]], .quant = "id", .col = "aa", .gene = "v"))
p2 = vis(spectratype(immdata$data[[1]], .quant = "count", .col = "aa", .gene = "v"))

grid.arrange(p1, p2, ncol = 2)

Interactive graphics with plotly

The plotly package allows plotting of interactive graphics directly from the ggplot2 objects, which could be more useful than static plots in some cases.

require(plotly)

Clonality

imm_top = repClonality(immdata$data, .method = "top")
plotly::ggplotly(vis(imm_top))
## Using Sample as id variables

Gene usage

imm_gu = geneUsage(immdata$data[[1]], "hs.trbv", .norm = T, .ambig = "exc")
plotly::ggplotly(vis(imm_gu, .plot = "hist", .ncol = 2, .grid = T))
## Using Names as id variables
imm_gu = geneUsage(immdata$data[c(5,6,7,8,9)], "hs.trbv", .norm = T, .ambig = "exc")
plotly::ggplotly(vis(imm_gu, .plot = "hist", .grid = F))
## Using Names as id variables
# plotly::ggplotly(vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box", .violin = F))
p1 = vis(spectratype(immdata$data[[1]], .quant = "id", .col = "aa", .gene = "v"))
plotly::ggplotly(p1)

License

The package is freely distributed under the GPL v3 license. You can read more about it here.

Additionally, we provide an annual subscription that includes next services:

Contact us at support@immunomind.io for more information.